This document records my work for week 3 of the regression course offered as part of the Machine Learning Specialization by the University of Washington via Coursera.

Obtention and cleaning

As before, the data were obtained in the form of .zip files, which were downloaded from the quiz webpage. These were stored in a directory on my desktop, and accessed with the following commands

rm(list = ls())
setwd("~/Desktop/Coursera-MLS-Multiple-regression")
# Obtain the full data set, the training and the testing data
allData <- read.csv(unzip(zipfile="./Assessing_Performance/kc_house_data.csv.zip"),
                    header = T, 
                    sep = ",", 
                    quote = " ", 
                    stringsAsFactors = T )
train_data <- read.csv(unzip(zipfile="./Assessing_Performance/kc_house_train_data.csv.zip"),
                       header = T, 
                       sep = ",", 
                       quote = " ", 
                       stringsAsFactors = T )
test_data <- read.csv(unzip(zipfile="./Assessing_Performance/kc_house_test_data.csv.zip"),
                       header = T, 
                       sep = ",", 
                       quote = " ", 
                       stringsAsFactors = T )

As before, the instructions for this assignment require that the each data point have the following classes: str, str, float, float, float, float, int, str, int, int, int, int, int, int, int, int, str, float, float, float, float. These were ammended appropriately

EXDA

Once again, the data provided are already well formatted

c(dim(train_data), dim(test_data))
## [1] 17384    21  4229    21
nrow(train_data[!complete.cases(train_data),])
## [1] 0
nrow(test_data[!complete.cases(test_data),])
## [1] 0
c(dim(train_data), dim(test_data))
## [1] 17384    21  4229    21

In each data set, there are 21 columns. There are 17384 rows in the training set, and 4229 rows in the test set. There are no missing values in either data set.

Write a function called ‘polynomial_sframe’ (or otherwise) which accepts an array ‘feature’ and a maximal ‘degree’ and returns an data frame with the first column equal to ‘feature’ and the remaining columns equal to ‘feature’ to increasing integer powers up to ‘degree’.

## poly_dataframe() accepts as input a data frame, a feature (the name of a single column in that data frame, wrapped in " "), and a degree, and returns a data frame whose consecutive columns are powers of the values of the feature, in increasing order up to the value of the entered degree
poly_dataframe <- function(dataframe, feature, degree){
        poly <- matrix(nrow = nrow(dataframe), ncol = degree)
        names<-vector()
        if (degree == 1){
                poly[,1] <- dataframe[[feature]]
                poly <- as.data.frame(poly)
                colnames(poly) <- "power_1"
        } else {
                columns <- vector()
                for (i in 1: degree){
                        names[i] <- paste("power_", i, sep = "")
                        poly[, i] <- dataframe[[feature]]^i
                        poly <- as.data.frame(poly)
                        colnames(poly) <- names
                        }
        }
        poly
}

Load in the data and also sort the sales SFrame by ‘sqft_living’. When we plot the fitted values we want to join them up in a line and this works best if the variable on the X-axis (which will be ‘sqft_living’) is sorted. For houses with identical square footage, we break the tie by their prices.

This is quite interesting. The number of houses listed is

length(train_data$sqft_living)
## [1] 17384

whereas the number of houses with a unique value for sqft_living is only about 5% of all of those:

length(unique(train_data$sqft_living))
## [1] 957

I feel that the instructions here could be better. Are we looking for only the maximal prices for a given sqft_living or are we sorting the data frame both by sqft_living and prices?

The sample python code suggests the latter:

sales = graphlab.SFrame('kc_house_data.gl/')
sales = sales.sort(['sqft_living','price'])

which is acheived with the function:

tie_Break <- function(dataframe, feature, output){
         dataframe<-dataframe[with(dataframe, order(dataframe[[feature]], dataframe[[output]])),]
         dataframe
}

but I am not happy with not knowing what is really being asked for.

Make a 1 degree polynomial SFrame with sales[‘sqft_living’] as the the feature. Call it ‘poly1_data’.

sales <- tie_Break(dataframe = train_data, 
                   feature = "sqft_living", 
                   output = "price")
poly1_data <- poly_dataframe(dataframe = sales, 
                             feature = "sqft_living", 
                             degree = 1)

Add sales[‘price’] to poly1_data as this will be our output variable

poly1_data <- cbind(poly1_data, sales$price)
colnames(poly1_data)[2] <- "price"
head(poly1_data)
##   power_1  price
## 1     290 142000
## 2     380 245000
## 3     384 265000
## 4     390 228000
## 5     420 229050
## 6     430  80000

Use graphlab.linear_regression.create (or another linear regression library) to compute the regression weights for predicting sales[‘price’] based on the 1 degree polynomial feature ‘sqft_living’

model1 <- lm(formula =price~power_1, data = poly1_data)

Next use the produce a scatter plot of the training data (just square feet vs price) and add the fitted model.

plot(x = poly1_data$power_1, 
     y = poly1_data$price, 
     xlab = "Square Feet", 
     ylab = "House Price", 
     main = "Comparison of Polynomial fits")
legend("topleft", 
       lwd=c(2,2.5), 
       col = c("red"), 
       legend = c("model1"))
abline(model1, lwd=3, col="red")

Now that you have plotted the results using a 1st degree polynomial, try it again using a 2nd degree and 3rd degree polynomial. Look at the fitted lines, do they appear as you would expect?

poly2_data <- poly_dataframe(dataframe = sales, 
                             feature = "sqft_living", 
                             degree = 2)
poly2_data <- cbind(poly2_data, sales$price)
colnames(poly2_data)[3] <- "price"
model2 <- lm(formula =price~., data = poly2_data)
plot(x = poly1_data$power_1, 
     y = poly1_data$price, 
     xlab = "Square Feet", 
     ylab = "House Price", 
     main = "Comparison of Polynomial fits")
legend("topleft", 
       lwd=c(2,2.5), 
       col = c("red", "blue"), 
       legend = c("model1", "model2"))
abline(model1, lwd=3, col="red")
# we cannot use abline() anymore, because that only works when you have a straight line fit, which we won't have here. The fitted values from the quadratic are included in the plot as the second argument of points()
points(x = poly1_data$power_1,
       y = fitted(model2), 
       col="blue", 
       pch = 20, 
       type = "l", 
       lwd=3)

We want to compare our fit to one that is generated by the poly() function

polyfit <- lm(formula = price~poly(x = power_1, 2), 
              data = poly1_data)
plot(x = poly1_data$power_1,
     y = poly1_data$price)
points(x = poly1_data$power_1, 
       y = fitted(model2), 
       col="blue", 
       pch = 20, 
       type = "l", 
       lwd=3)
points(x = poly1_data$power_1,
       y = fitted(polyfit), 
       col="dark green", 
       pch = 20, 
       type = "l", 
       lwd=3)

And it’s a match!

poly3_data <- poly_dataframe(dataframe = sales, 
                             feature = "sqft_living", 
                             degree = 3)
poly3_data <- cbind(poly3_data, sales$price)
colnames(poly3_data)[4] <- "price"
model3 <- lm(formula =price~., data = poly3_data)
plot(x = poly1_data$power_1, 
     y = poly1_data$price, 
     xlab = "Square Feet", 
     ylab = "House Price", 
     main = "Comparison of Polynomial fits")
legend("topleft", 
       lwd=c(2,2.5), 
       col = c("red", "blue", "dark green"), 
       legend = c("model1", "model2", "model3"))
abline(model1, lwd=3, col="red")
points(x = poly1_data$power_1, 
       y = fitted(model2), 
       col="blue", 
       pch = 20, 
       type = "l", 
       lwd=3)
points(x = poly1_data$power_1, 
       y = fitted(model3), 
       col="dark green", 
       pch = 20, 
       type = "l", 
       lwd=3)

Now try a 15th degree polynomial. Print out the coefficients and look at the resulted fitted line. Do you think this degree is appropriate for these data?

poly15_data <- poly_dataframe(dataframe = sales, 
                             feature = "sqft_living", 
                             degree = 15)
poly15_data <- cbind(poly15_data, sales$price)
colnames(poly15_data)[16] <- "price"
model15 <- lm(formula =price~., data = poly15_data)
plot(x = poly1_data$power_1, 
     y = poly1_data$price, 
     xlab = "Square Feet",
     ylab = "House Price", 
     main = "Comparison of Polynomial fits")
legend("topleft", 
               lwd=c(2,2.5), 
               col = c("red", "blue", "dark green", "purple"), 
               legend = c("model1", "model2", "model3", "model15"))
abline(model1, lwd=3, col="red")
points(x = poly1_data$power_1, 
       y = fitted(model2), 
       col="blue", 
       pch = 20, 
       type = "l", 
       lwd=3)
points(x = poly1_data$power_1, 
       y = fitted(model3), 
       col="dark green", 
       pch = 20, 
       type = "l", 
       lwd=3)
points(x = poly1_data$power_1, 
       y = fitted(model15), 
       col="purple", 
       pch = 20, 
       type = "l", 
       lwd=3)

summary(model15)
## 
## Call:
## lm(formula = price ~ ., data = poly15_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2101916  -139284   -27346    99130  2488996 
## 
## Coefficients: (2 not defined because of singularities)
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept)  8.281e+05  6.277e+05   1.319    0.187
## power_1     -3.832e+03  3.147e+03  -1.218    0.223
## power_2      9.360e+00  6.645e+00   1.409    0.159
## power_3     -1.159e-02  7.832e-03  -1.481    0.139
## power_4      8.601e-06  5.768e-06   1.491    0.136
## power_5     -4.092e-09  2.816e-09  -1.453    0.146
## power_6      1.298e-12  9.423e-13   1.378    0.168
## power_7     -2.799e-16  2.196e-16  -1.274    0.203
## power_8      4.111e-20  3.575e-20   1.150    0.250
## power_9     -4.063e-24  4.013e-24  -1.013    0.311
## power_10     2.606e-28  2.999e-28   0.869    0.385
## power_11    -9.991e-33  1.378e-32  -0.725    0.468
## power_12     1.822e-37  3.116e-37   0.585    0.559
## power_13            NA         NA      NA       NA
## power_14    -3.217e-47  9.860e-47  -0.326    0.744
## power_15            NA         NA      NA       NA
## 
## Residual standard error: 246500 on 17370 degrees of freedom
## Multiple R-squared:  0.5559, Adjusted R-squared:  0.5555 
## F-statistic:  1672 on 13 and 17370 DF,  p-value: < 2.2e-16

It seems pretty clear that this model is inappropriate for these data. None of the features are significant, and just by eyeballing it it seems overfit.

If we were to use a different subset of the data do you think we would get pretty much the same curve?

Actually, I don’t. The curve is overfitted, so it makes sense that with a different data set that the curve would look different. We get to test this idea in the next question:

Estimate a 15th degree polynomial on all 4 sets, plot the results and view the coefficients for all four models

subset1 <-read.csv(unzip(zipfile="./Assessing_Performance/wk3_kc_house_set_1_data.csv.zip"),
                       header = T, 
                       sep = ",", 
                       quote = " ", 
                       stringsAsFactors = T )
subset2 <-read.csv(unzip(zipfile="./Assessing_Performance/wk3_kc_house_set_2_data.csv.zip"),
                       header = T, 
                       sep = ",", 
                       quote = " ", 
                       stringsAsFactors = T )
subset3 <-read.csv(unzip(zipfile="./Assessing_Performance/wk3_kc_house_set_3_data.csv.zip"),
                       header = T, 
                       sep = ",", 
                       quote = " ", 
                       stringsAsFactors = T )
subset4 <-read.csv(unzip(zipfile="./Assessing_Performance/wk3_kc_house_set_4_data.csv.zip"),
                       header = T, 
                       sep = ",", 
                       quote = " ", 
                       stringsAsFactors = T )

All four subsets were fixed to represent the different data types

c(dim(subset1), dim(subset2), dim(subset3), dim(subset4))
## [1] 5404   21 5398   21 5409   21 5402   21
sales <- tie_Break(dataframe = subset1, 
                   feature = "sqft_living", 
                   output = "price")
subset1_data <- poly_dataframe(dataframe = sales, 
                             feature = "sqft_living", 
                             degree = 15)
subset1_data <- cbind(subset1_data, sales$price)
colnames(subset1_data)[16] <- "price"
subset1_model <- lm(formula = price~., data = subset1_data)
plot(x = subset1_data$power_1, 
     y = subset1_data$price, 
     xlab = "Square Feet", 
     ylab = "House Price", 
     main = "15th degree polynomial fit to subset 1")
points(x = subset1_data$power_1, 
       y = fitted(subset1_model), 
       col="red", 
       pch = 20, 
       type = "l", 
       lwd=3)

It is already clear that the above curve differs significantly from the 15th degree model previously fit.

Quiz Question:

Is the sign (positive or negative) for power_15 the same in all four models?

subset1_model$coefficients[16]
## power_15 
##       NA
subset2_model$coefficients[16]
## power_15 
##       NA
subset3_model$coefficients[16]
## power_15 
##       NA
subset4_model$coefficients[16]
##     power_15 
## -1.32673e-47

Three of the four models do not even assign a weight to power_15

Quiz Question:

True/False the plotted fitted lines look the same in all four plots

Obviously this is not the case.

Download the provided csv files for training, validation and test data

train_data <-read.csv(unzip(zipfile="./Assessing_Performance/wk3_kc_house_train_data.csv.zip"),
                       header = T, 
                       sep = ",", 
                       quote = " ", 
                       stringsAsFactors = T )

validation_data <-read.csv(unzip(zipfile="./Assessing_Performance/wk3_kc_house_valid_data.csv.zip"),
                       header = T, 
                       sep = ",", 
                       quote = " ", 
                       stringsAsFactors = T )

test_data <-read.csv(unzip(zipfile="./Assessing_Performance/wk3_kc_house_test_data.csv.zip"),
                       header = T, 
                       sep = ",", 
                       quote = " ", 
                       stringsAsFactors = T )

Now for each degree from 1 to 15:

sales <- tie_Break(dataframe = train_data, 
                   feature = "sqft_living", 
                   output = "price")
RSS<-vector()
for (i in 1:15){
        model<-lm(formula = price~poly(sqft_living, i),
                  data =  sales)
        plot(x = sales$sqft_living, y = sales$price, xlab = "Square Feet Living", ylab = "House Price", main = paste("Polynomial fit to the training data of degree", i, sep = " "))
        points(x = sales$sqft_living, 
               y = fitted(model), 
               type = "l", 
               lwd=3, 
               col="red")
        predictions<-predict(object = model, 
                             newdata = validation_data)
        residuals<-predictions-validation_data$price
        RSS[i]<-sum(residuals^2)
        
}

which.min(RSS)
## [1] 5
Quiz Question:

Which degree (1, 2, …, 15) had the lowest RSS on Validation data?

This is given by the model with degree 5

Now that you have selected a degree compute the RSS on TEST data for the model with the best degree from the Validation data.

model<-lm(formula = price~poly(sqft_living, 3),
                  data =  sales)
predictions<-predict(object = model, 
                             newdata = test_data)
residuals<-predictions-test_data$price
RSS<-sum(residuals^2)
Quiz Question:

what is the RSS on TEST data for the model with the degree selected from Validation data? (Make sure you got the correct degree from the previous question)

This is 1.3558610^{14}

sales <- tie_Break(dataframe = train_data, 
                   feature = "sqft_living", 
                   output = "price")
validation<- tie_Break(dataframe = validation_data, feature = "sqft_living", output = "price")

RSS<-vector()
for (i in 1:15){
        polydf<-poly_dataframe(dataframe = sales, 
                               feature = "sqft_living", 
                               degree = i)
        validation_df<-poly_dataframe(dataframe = validation, 
                               feature = "sqft_living", 
                               degree = i)
        polydf<-cbind(polydf, sales$price)
        validation_df<-cbind(validation_df, validation$price)
        colnames(polydf)[ncol(polydf)]<-"price"
        colnames(validation_df)[ncol(validation_df)]<-"price"
        model<-lm(price~., data = polydf)
        predictions<-predict(object = model, newdata = validation_df)
        residuals<-validation_df$price-predictions
        RSS[i]<-sum(residuals^2)
}
## Warning in predict.lm(object = model, newdata = validation_df): prediction
## from a rank-deficient fit may be misleading
## Warning in predict.lm(object = model, newdata = validation_df): prediction
## from a rank-deficient fit may be misleading
## Warning in predict.lm(object = model, newdata = validation_df): prediction
## from a rank-deficient fit may be misleading
which.min(RSS)
## [1] 5